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A commentary on 

Speech through ears and eyes: interfacing 
the senses with the supramodal brain 

by van Wassenhove, V. (2013). Front. 
Psychol. 4:388. doi: 10.3389/fpsyg.2013.00388 

The multimodal nature of perception has 
generated several questions of importance 
pertaining to the encoding, learning, 
and retrieval of linguistic representations 
(e.g., Summerfield, 1987; Altieri et al., 
2011; van Wassenhove, 2013). Historically, 
many theoretical accounts of speech per- 
ception have been driven by descrip- 
tions of auditory encoding; this makes 
sense because normal-hearing listeners 
rely predominantly on the auditory sig- 
nal. However, from both evolutionary 
and empirical standpoints, comprehen- 
sive neurobiological accounts of speech 
perception must account for interactions 
across sensory modalities and the interplay 
of cross-modal and articulatory represen- 
tations. These include auditory, visual, and 
somatosensory modalities. 

In a recent review, van Wassenhove 
(2013) discussed key frameworks describ- 
ing how visual cues interface with the 
auditory modality to improve auditory 
recognition (Sumby and Pollack, 1954), or 
otherwise contribute to an illusory percept 
for mismatched auditory-visual syllables 
(McGurk and MacDonald, 1976). These 
frameworks encompass multiple levels of 
analysis. Some of these higher cognitive 
processing models that discuss parallel 
processing (Altieri and Townsend, 2011) 
or the independent extraction of features 
from the auditory and visual modalities 



(Massaro, 1987, Fuzzy Logical Model of 
Perception), early feature encoding (van 
Wassenhove et al, 2005), and encod- 
ing/timing at the neural level (Poeppel 
et al, 2008; Schroeder et al, 2008). 

This commentary on van Wassenhove 
(2013) will examine predictive coding 
hypotheses as one theory for how visemes 
are matched with auditory cues. Crucially, 
a hypothesized role shall be empha- 
sized for cross-modal neural plasticity and 
multisensory learning in reinforcing the 
sharing of cues across modalities into 
adulthood. 

PREDICTIVE ENCODING AND FIXED 
PRIORS 

A critical question in speech research con- 
cerns how time-variable signals interface 
with internal representations to yield a sta- 
ble percept. Although speech signals are 
highly variable (multiple talkers, dialects, 
etc.), our percepts appear stable due to 
dimensionality reduction. These questions 
become even more complex in multisen- 
sory speech perception since we are now 
dealing with the issue of how visual speech 
gestures coalesce with the auditory signal 
as the respective signals unfold at different 
rates and reach cortical areas at differ- 
ent times. In fact, these signals must co- 
occur within an optimal spatio-temporal 
window to have a significant probabil- 
ity of undergoing integration (Conrey and 
Pisoni, 2006; Stevenson et al., 2012). 

The predictive coding hypothesis 
incorporates these aforementioned obser- 
vations to describe integration in the 
following ways: (1) Temporally con- 
gruent auditory and visual inputs will 



be processed by cortical integration 
circuitry, (2), internal representations 
("fixed Bayesian priors") are compared 
and matched against the inputs, and (3) 
hypotheses about the intended utterance 
are actively generated, van Wassenhove 
et al.'s (2005) EEG study exemplified key 
components of the visual predicative cod- 
ing hypothesis. When presented with 
auditory and visual syllables in normal 
conversational settings, the visual sig- 
nal leads the auditory by tens or even 
hundreds of milliseconds. Thus, featural 
information in the visual signal constrains 
predictions about the content of the audi- 
tory signal. The authors showed that early 
visual speech information speeds-up audi- 
tory processing, as evidenced by temporal 
facilitation in the early auditory ERPs. This 
finding was interpreted as a reduction in 
the residual error in the auditory signal by 
the visual signal. One promising hypothe- 
sis is that visual information interacts with 
the auditory cortex in such a way that it 
modulates excitability in auditory regions 
via oscillatory phase resetting (Schroeder 
et al., 2008). Predictive coding hypothe- 
ses may also be extended to account for 
broad classes of stimuli including speech 
and non-speech, and matched and mis- 
matched signals — all of which have been 
shown to evoke early ERPs associated 
with visual prediction (Stekelenburg and 
Vroomen, 2007). 

FIXED PRIORS 

Hypothetically, visual cues can provide 
predictive information so long as they pre- 
cede the auditory stimulus and provide 
reliable cues (see Nahorna et al., 2012). 
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FIGURE 1 | Inputs interact with noise while evidence for a category (e.g., "ba") accumulates 
toward threshold (y). Once enough information in either modality reaches threshold, a decision is 
made (e.g., "ba" vs. "da"). Visual information interacts with auditory cortical regions (dotted line) 
leading to updated priors. This model does not rule out the possibility that auditory cues can 
reciprocally influence viseme recognition. 



A critical issue pertaining to visual 
predictive coding, then, relates to the 
"rigidity" of the internal rules (fixed 
priors), van Wassenhove (2013) dis- 
cussed research suggesting the stability 
of priors/representations that are innate 
or otherwise become firmly established 
during critical developmental periods 
(Rosenblum et al., 1997; Lewkowicz, 
2000). Lewkowicz (2000) argued that 
the ability to detect multisensory syn- 
chrony and match "duration and rate" are 
established early in life. In the domain of 
speech, Rosenblum and colleagues have 
argued that infants are sensitive to the 
McGurk effect and also to matched vs. 
mismatched articulatory movements and 
speech sounds. 

While these studies suggest some rigid- 
ity of priors, I would emphasize that 
prior probabilities or "internal rules" 
remain malleable into adulthood. This 
adaptive perspective finds support among 
Bayesian theorists who argue that pri- 
ors are continually updated in light of 
new evidence. Research indicates that dif- 
ferences in the ability to detect sub- 
tle auditory-visual asynchronies changes 
even into early adulthood (Hillock et al, 
2011). Additionally, perceptual learning 
and adaptation techniques can alter pri- 
ors in such a way that perceptions of 
asynchronies are modified via practice 
(Fujisaki et al, 2004; Vatakis et al, 2007; 
Powers et al., 2009) or experience with 
a second language (Navarra et al., 2010). 
Importantly, continual updating of "fixed" 
priors allows adult perceivers to (re)learn, 
fine tune, and adapt to multimodal signals 



across listening conditions, variable talk- 
ers, and attentional loads, van Wassenhove 
(2013) discussed how subjects can "auto- 
matically" match pitch and spatial fre- 
quency patterns (Evans and Treisman, 
2010). This certainly shows that subjects 
can match auditory and visual informa- 
tion based on prior experience. Altieri 
et al. (2013) have also shown that adults 
can learn to match auditory and visual pat- 
terns more efficiently after only one day 
of practice! Reaction times and EEG sig- 
nals indicated rapid learning and higher 
integration efficiency after only 1 h of 
training, followed by a period of grad- 
ual learning that remained stable over 
1 week. 

Such findings appear consistent with 
a unified parallel framework where visual 
information influences auditory process- 
ing and where visual predictability can 
be reweighted through learning. Figure 1 
represents an attempt to couch predictive 
coding within adaptive parallel accounts of 
integration. 
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